The goal of this project is to explore the chemical properties found in a tidy data set of white wines and to understand and summarize which properties is closely related to the wine quality score.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
I found out that the first col X is a unique index variable for each individual observation. I think it is not very useful to the analysis so its best to remove it before procceding to the next step.
We already know that the quality of the wine is rated by wine experts from the scale of 0 (very bad) to 10 (very excellent)
The lowest and highest quality score given is 3 and 9, with a mean of 5.878
All the attribute provided is decimal values except quality which is a integer.
Attributes like fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, sulphates have max value that is greater than 75% quantile.
Lets take a look at each properties to get a sense of data distribution.
## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
## 0 1 2 3 4 5 6 7 8 9 10
## 0 0 0 20 163 1457 2198 880 175 5 0
Mode of the quality score is 6. Same as what we saw in summary of quality_as_factor.
I factor the wine quality from an integer variable to a categorical varible (quality_as_factor) as its kind of arbitary and can be represented by another form of values not just integer. The histogram shows the shape of normal distribution.
The range of possible scores is from 0 to 10, in the dataset the minimum score is 3 and maximum is 9. The mean is 5.878 and the median is 6 which is very close to each other.
##
## 3.8 3.9 4.2 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5
## 1 1 2 3 1 1 5 9 7 24 23 28 27 28 31
## 5.6 5.7 5.8 5.9 6 6.1 6.15 6.2 6.3 6.4 6.45 6.5 6.6 6.7 6.8
## 71 88 121 103 184 155 2 192 188 280 1 225 290 236 308
## 6.9 7 7.1 7.15 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8 8.1 8.2
## 241 232 200 2 206 178 194 123 153 93 93 74 80 56 56
## 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7
## 52 35 32 25 15 18 16 17 6 21 3 11 2 5 4
## 9.8 9.9 10 10.2 10.3 10.7 11.8 14.2
## 8 2 3 1 2 2 1 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
There seems be a right tail when the histogram was initially chartted. Once the outlier which is a value 11.8 and 14.2 is removed, the histogram show a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.09700 -0.67780 -0.58500 -0.58090 -0.49490 0.04139
By transforming the variable using a log based 10, we able to remove the right tail and turn it into a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Using a box plot, we can cleary visualize the outlier which may cause the histogram to have a long right tail. Once we removed that, our histogram shows a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2218 0.2304 0.7160 0.6432 0.9956 1.8180
I try to remove the outlier after looking at the boxplot. The histogram seems to show a positive skewed graph. However, after applying a log based 10 transformation. This histogram still does not appear to be a normal distribution. It looks more like a bi-model distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.0460 -1.4440 -1.3670 -1.3680 -1.3010 -0.4609
Chlorides shows a normal distribution once a log based 10 transformation is applied.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Once outlier is removed, the histogram appears as a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Same for total.sulfur.dioxide once right tail outlier is removed, the histogram appears as a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH seems to give a normal distribution once histogram is graphed. There is no transformation that needs to be applied to it.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.65760 -0.38720 -0.32790 -0.32100 -0.25960 0.03342
Applying log based 10 transformation gives us a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The desity histogram shows a normal distribution once the outlier is removed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
## 3 4 5 6 7 8 9
## 10.34500 10.15245 9.80884 10.57537 11.36794 11.63600 12.18000
This histogram shows a slight positive skewed distribution with a peak between 9 and 10.
There are 4,898 observations and 13 variables in the dataset. First variable X is basically an index of each observations. It is being removed as its useless for the analysis. The rest of the variables which is stored as numerical data type is basically the properties of white wine. Last variable is an integer quality score which is rated by wine experts from the scale of 0 (very bad) to 10 (very excellent). The quality score is converted to a factor data types. Most of the wines have a quality score of 6, the lowest and highest score given is 3 and and 9.
The main feature that is important to this analysis is the wine quality. Analysis needs to be done on the wine properties to see whether it has an impact to this outcome.
Lets take a look at how correlated the variables are:
## [,1]
## fixed.acidity -0.113662831
## volatile.acidity -0.194722969
## citric.acid -0.009209091
## residual.sugar -0.097576829
## chlorides -0.209934411
## free.sulfur.dioxide 0.008158067
## total.sulfur.dioxide -0.174737218
## density -0.307123313
## pH 0.099427246
## sulphates 0.053677877
## alcohol 0.435574715
We can see that the following variables are correlated with quallity:
We will concentrate on the top variables that show strong correlation.
Yes, since we know that the mode of quality score is 6. We would consider the average is 6 in a 0 to 10 scale. We can define a cut in the scores.
##
## (0,5] (5,6] (6,10]
## 1640 2198 1060
So we will have three groups of wines score after the cut. First group has a quality score from 0 to 5, which we can consider the bad quality group. Then the second group which has a score of 6 average quality group and lastly the 7 to 10 quality score group is the best quality group. We will these groups in our analysis.
I created a few histograms to understand the distribution of the features and box plots to find out the outliers. Yes, there were a few outliers in the features in which I removed to get it to look gaussian or normal distribution. I also applied a log based 10 transformation on the features which had long tails so that the features becomes gaussian. qualityand pH were the features which I did not apply any transformations as they look like normal distribution. Alcohol feature looks like a positive skewed distribution.
I decided to plot out ggpairs scatterplot matrix to have a look into the relationships between the variables. I found out that the 4 most correclated variables with quality were alcohol, ph, chlorides, density. But lets plot all other variables as well.
We already see a tendency in the boxplot, this can be better illustrate with a scatter plot and a linear regression line in it. Good wines tends to have higher alcohol level and higher pH.
We explore further to find out the influence of main features with other secondary features.
We can see negative influence of these variables density, residual.sugar, chlorides, total.sulfur.dioxide, free.sulfur.dioxide in alcohol level. Only pH shows a positive correlation.
## [1] -0.4258583
##
## Pearson's product-moment correlation
##
## data: wines$pH and wines$residual.sugar
## t = -13.8472, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2209387 -0.1670352
## sample estimates:
## cor
## -0.1941335
I certainly would expect fixed.acidity to be correlated to pH. In chemistry when acidity goes down, the pH value should goes up as well. There’s no clear pattern on residual.sugar and pH, even though they are negative correlated.
##
## Pearson's product-moment correlation
##
## data: wines$free.sulfur.dioxide and wines$total.sulfur.dioxide
## t = 54.6447, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5977994 0.6326026
## sample estimates:
## cor
## 0.615501
Besides main feature and secondary feature. Other features such as free.sulfur.dioxide and total.sulfur.dioxide are correlated to one another.
##
## Pearson's product-moment correlation
##
## data: wines$density and wines$residual.sugar
## t = 107.8749, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
When top 1% of observations is excluded and plotting both of these two variables residual.sugar and density together shows a strong correlation.
Since our main goal was to find out what features affects quality score. The main relationship observed was alcohol and quality, pH and quality. Besides that density and chlorides have negative influence on qualty.
The other features which shows interesting relationship was free.sulfur.dioxide and total.sulfur.dioxide with a correlation of 0.616. Both are correlated to one another as both are free and bound forms of sulfur dioxide gas (S02).
The strongest relationship found was density and residual.sugar with a correlation of 0.828.
Since we found out that density and residual.sugar has the strongest relationship. I explored more with alcohol. From the plot it appears that as density and residual.sugar increases the alcohol color seems to turn darker which means alcohol decreses.
When I swapped out alcohol with my defined new variable quality cut. The result were more obvious. We can see that better quality score wines are concentrated on the left hand side of the plot where as bad quality score wines are on the right.
Initially, I thought pH has a little bit of correlation with quality. Thus pH vs fixed.acidity should somewhat linked with quality. After plotting it out there seems no strong pattern that we can identify from pH vs fixed.acidity vs quality.
The relationship between density and residual.sugar and alcohol seems quite interesting. Since all of them are highly correlated with one another, it is easy to spot the changes as any one of the variable varies. When we swapped out alcohol with quality score the same correlation can be spotted as well.
As I mentioned above, I thought pH would provide some interaction with quality. But after plotting it out. It’s a bit hard to identify it from the plot.
When the histogram is facet by quality score cut. We can see that bad quality score wines show a positive skewed towards lower alcohol level. Average quality score wines shows us more of a shape of normal distribution of alcohol. Best quality score shows a negatively skewed distribution towards higher alcohol level.
This graph shows that as density and residual sugar increase, alcohol level decreases. This can be clearly seen in best quality score wines where the color goes from light to dark. Even thought the same color variation applies to average quality score wines and bad quality score wines but its not that significant.
In this graph, using the newly defined variable cut of quality score. We can see that best wines quality score (8~10) concentrate in the lower right quadrant. That is when density is low and alcohol level is high. For bad quality score wines (0~5), its concentrated in the upper left quadrant with high density and low alcohol level.
There are two difficulties which I encountered during the analaysis. One, the lack of categorical variables. In the dataset there is only one categorical variable that we can use which is quality, I wish that there were more variables like that as it would allow us to identify more relationship between the variables using subset. Two, the lack of correlations between variables. There are some variables that shows very low correlation with any other variables.
Even with the limitation that we had in our dataset, we are still able to discover very interesting findings such as alcohol and density and residual.sugar. I was only able to identify a few of them to understand its influvence over quality score from a scatterplot matrix.
There are many other factors that can determine a good wine. Things like smells and flavours and not chemical properties can be documented in the dataset to allow us to explore further to find out what is a good quality wines.